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Abstract — For parallel breadth first search (BFS) algorithm 
on large-scale distributed memory systems, communication often 
costs significantly more than arithmetic and limits the scalability 
of the algorithm. In this paper we sufficiently reduce the 
communication cost in distributed BFS by compressing and 
sieving the messages. First, we leverage a bitmap compression 
algorithm to reduce the size of messages before communication. 
Second, we propose a novel distributed directory algorithm, cross 
directory, to sieve the redundant data in messages. Experiments 
on a 6,144-core SMP cluster show our algorithm outperforms 
the baseline implementation in Graph500 by 2.2 times, reduces 
its communication time by 79.0%, and achieves a performance 
rate of 12.1 GTEPS (billion edge visits per second). 

I. Introduction 

Recently, graph has been extensively used to abstract com- 
plex systems and interactions in emerging "big data" applica- 
tions, such as social network analysis, World Wide Web, bio- 
logical systems and data mining. With the increasing growth 
in these areas, petabyte-sized graph datasets are produced for 
knowledge discovery HI, 0, which could only be solved 
by distributed machines; benchmarks, algorithms and runtime 
systems for distributed graph have gained much popularity in 
both academia and industry Q, JU, 13, (6). One of the most 
widely used graph-searching algorithms is breadth-first search 
(BFS), which serves as a building block for a great many 
graph algorithms such as minimum spanning tree, betweenness 
centrality, and shortest paths 171. <8l. l9l. iflOl. 

Implementing a distributed BFS with high performance, 
however, is a challenging task because of its expensive com- 
munication cost ifTTl . 0. Generally, algorithms have two 
kinds of costs: arithmetic and communication. For distributed 
algorithms, communication often costs significantly more than 
arithmetic. For example, on a 512-node cluster, the baseline 
BFS algorithm in Graph 500 spends about 70% time on 
communication during its traversal on a scale-free graph with 
8 billion vertices (Figure[TJ. Therefore the most critical task in 
a distributed BFS algorithm is to minimize its communication. 

Several different approaches are proposed to optimize 
communication in distributed BFS (Table [I): using two- 
dimensional partitioning of the graph to reduce communication 




Number of Nodes 

Fig. 1, Time breakdown of a baseline distributed BFS in a weak scaling 
experiment that use fixed problem size per node (each node has about 16M 
vertices). 

TABLE I 

Comparison of various approaches for reducing 
communication cost in distributed bfs. 



Approach 


Category 


Two-dimensional partitioning (4), (5] 


algorithm 


Bitmap & sparse vector (3), 


data structure 


PGAS with communication coalescing 1121 


runtime 


This Work: compression & sieve 


data structure 



overhead |4j, 0, using bitmap or sparse vector to reduce 
the size of messages 0, 0, or applying communication 
coalescing in PGAS implementation to minimize message 
overhead lTT2l . These approaches attack the problem from 
different angles: algorithm, data structure and runtime. In this 
paper, we will focus on reducing the size of communication 
messages (the optimization of data structures). The main 
techniques we use are compression and sieve. Overall, we 
make the following contributions: 

• By compressing the messages, we reduce the communica- 
tion time by 52.4% and improved its overall performance 
by 1.7x compared to the baseline BFS algorithm. 



P,:{0,1} P,:{2,3} P 3 :{4,5} P 4 :{6,7} 




TABLE II 

The size of the frontier, represented as bitmap or sparse 
vector, at each level of bfs of a scale-free graph with 1.6 
billion vertices . for sparse vector, each vertex is represented 
as a 64-bit number. 



Fig. 2. The operation of BFS on an undirected graph. The frontier / is 
represented as a vector. 



• By sieving the messages with a novel distributed direc- 
tory before compression. We further reduce the com- 
munication by 55.9% and improved the performance 
by another 1.3 x, achieving a total 79.0% reduction in 
communication and 2.2 x performance improvement over 
the baseline implementation. 
> We implement and analyse several compression methods 
for bitmap compression. Our experiment shows the space- 
time tradeoff of different compression methods. 
In the next section we will introduce the problem with an 
example. Section iLLTl will describe the baseline BFS algorithm. 
Section [TV] and Section [V] will describe our BFS algorithms 
with compression and sieve. The analysis and experiment 
results are presented in Section [VI] and Section I VIII followed 
by related works and concluding remarks in Section IVIIII and 
Section HH 

II. Motivation 

We start with an example illustrating the breadth-first search 
(BFS) algorithm. Given a graph G = (V, E) and a distin- 
guished source vertex s, breadth-first search systematically 
explores the edges of G to "discover" every vertex that is 
reachable from ,s. In Figure [2] the source vertex is painted 
black when the algorithm begins. Then it explores its adjacent 
vertices: 3, 5 and 2, and paints them black. The exploration 
goes on until all vertices are visited. Vertices discovered the 
first time is painted black; discovered vertices are painted solid 
grey; vertices to be discovered are painted grey with black 
edge. The frontier / of the graph is the set of the vertices 
which are discovered the first time. 

For distributed BFS, the vertices as well as the frontier 
are divided among processors: P\ : {0,1}, P2 : {2,3}, 
P 3 : {4, 5}, P 4 : {6, 7}. And the global information of the 



Level 


#Vertices 


bitmap 


sparse vector 


1 


2 


196.9MB 


16B 


2 


20842 


196.9MB 


162.8KB 


3 


235274348 


196.9MB 


2.0GB 


4 


1377666413 


196.9MB 


10.2GB 


5 


38582585 


196.9MB 


294.4MB 


6 


88639 


196.9MB 


692.4KB 


7 


211 


196.9MB 


1.69KB 


Total 


1651633040 


1.4GB 


12.4GB 



frontier can only be retrieved through communication. For Pi 
in this example, it only "owns" the information of whether 
vertex and 1 are visited. If it want to identify whether vertex 
2 is visited, it needs to ask this information from P^. The 
common way to update the global / is to use MPI collective 
communication like ALLGATHER at the end of each level (4), 
0, @. 

The most critical task for distributed BFS is to reduce the 
size of the frontier, which directly influence the size of the 
messages communicated. To reduce it, bitmap or sparse vector 
is commonly used to represent the frontier. Bitmap use a vector 
of size \y\ to represent the frontier, each bit of the vector 
representing a vertex: 1 means it is included in the frontier, 
means it is not. Sparse vector includes the frontier vertices 
only, each is represented using 64 bits. For graphs of diameter 
d, bitmap is generally better when d < 64. Table [II] provides 
an example of the size of the frontier represented as bitmap or 
sparse vector, for a scale-free graph of 1.6 billion vertices. In 
this case, for d = 7, the total size of messages using bitmap is 
1.4 GB, much less than the sparse vector's 12.4 GB. Despite 
the huge space saved by bitmap, there remains two problems: 

« The problem of bitmap is that it need to contain all the 
vertices to keep the position information of each vertex. 
For the above example, to represent 2 vertices at level 1, 
the size of the bitmap frontier is still 196.9MB, where 
most of the elements are zero. Fortunately, these zeros 
can be condensed. We leverage lossless compression to 
reduce the size of the bitmap. 

• The other problem is the expensive broadcast cost of 
the ALLGATHER collective communication, which broad- 
casts all vertices to all processors. In fact, each processor 
needs only a small fractions of the frontier. For example, 
in Figure [2](b), P2 does not need to send the information 
of vertex 2 to P4, because vertex 2 does not has a direct 
edge connecting to the vertices of P4. We propose a 
distributed directory to sieve the bitmap vectors before 
compression, further reducing its message size. 



Algorithm 1: A baseline distributed BFS 



Algorithm 2: Distributed BFS with compression. 



Input : s: source vertex id 

1 f(s) <- s; 

2 foreach processor Pi in parallel do 

3 while / 7^ do 

4 t t <- Ai /; 

5 ti 4— ti TTi', 7Tj 7Tj + t j ', 

6 fi <- *»; 

7 / ALLGATHERV(/j,Pj); 



1 /(«) 4- s; 

2 foreach processor Pi in parallel do 

3 while / ^ do 

4 i, <- A, /; 

5 i, <- U 7f7; 

6 71", <— 71", + i^; /j <— ^; 

7 // «- Compress(fi); 

8 /' «- Allgatherv(//, p); 

9 / <— Uncompress(f'); 



III. Baseline BFS with Bitmap as Frontier 

A. BFS Described in Linear Algebra 

Let A denote the adjacency matrix of the graph G, fLk 
denote the frontier at level k, and nk = [L=i fi*i denote the 
visited information of previous frontiers. The exploration of 
level k in BFS is algebraically equivalent to a sparse matrix 
vector multiplication (SpMV): /L(fe+i) <— A T f^f. Wj: 
(we will omit the transpose and assume that the input is pre- 
transposed for the rest of this section). For example, traversing 
from level one (Figure |2] (a)) to level two (Figure |2] (b)) is 
equivalent to the linear algebra below. 
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The syntax denotes the matrix-vector multiplication 
operation, denotes element- wise multiplica- 
tion, (ai,a 2 , • • • ,a„) T (&i, 6 2 , - • • ,b n ) T = 
(ai&i, 0262, • • ■ ! ci n b n ) T , and overline represents the 
complement operation. In other words, Ul — for Vi ^ and 
ui = 1 for Vi = 0. 

In Figure [2] BFS starts from vertex Vq, thus f^ = 
{vq}Jli = {v2,vz,v b }, f L2 = {V4,V 6 ,V 7 }, f L3 = {vi}. 
If we use a vector of size n to represent the corresponding 
frontier fLk, for example, Jli = {0,0,1,1,0,1,0,0}. This 
algorithm becomes deterministic with the use of (select, max)- 
semiring, because the parent is always chose to be the vertex 
with the highest label. 

B. Baseline BFS 

Algorithm [T] describes the baseline BFS. Each loop block 
(starting in line 3) performs a single level traversal. / rep- 
resents the current frontier, which is initialized as an empty 
bitmap; t is an bitmap that holds the temporary parent infor- 
mation for that iteration only; tt is the visited information of 
previous frontiers. The computational step (line 4,5,6) can be 
efficiently parallelized with multithreading. For SpMV opera- 
tion in line 4, the matrix data is naturally splitted into pieces 



TABLE III 
A WAH COMPRESSED BITMAP. 



16 bits 1000000000000000 
3-bit groups 100 000 000 000 000 
WAH 0100 1100 0000 



for multithreading. At the end of each loop, ALLGATHER 
updates / with MPI collective communication. 

IV. BFS with Compression 

For large graphs, the communication time of distributed 
BFS algorithms can take as much as seventy percent of the 
total execution time. To reduce it, we need to reduce the size of 
the messages. One simple way is to use lossless compression, 
trading computation for bandwidth. 

Algorithm|2]describe the distributed BFS with compression. 
The difference between Algorithm [2] and Algorithm Q] are line 
7 and 9. At line 7 the frontier vector / is first compressed into 
/' before communication. At line 9 /' is uncompressed back 
to / after communication. 

We use word-aligned hybrid (WAH) lfl3l for Compress and 
Uncompress function, as WAH is fast and well suited for 
bitmap compression. Table [ill] shows the WAH compressed 
representation of 16 bits. In WAH, there are three types of 
words: literal words, fill words and active words. The most 
significant bit of a word is used to distinguish between a literal 
word (0) and a fill word (1). And a active word stores the last 
few bits. We assume that each computer word contains 4 bits 
and all fill bits are in this example. Under this assumption, 
each literal word stores 3 bits from the bitmap, and each 
fill word represents a multiple of 3 bits. The second line in 
Table Hn] shows the bitmap as 3-bit groups. The last line shows 
the WAH words. The first two words are regular words, the 
first is a literal word, and the second a fill word. The fill 
word 1100 indicates a 0-fill of 4 words long (containing 12 
consecutive bits). Note that the fill word stores the fill length 
as 4 rather than 12. The third word is the active word; it stores 
the last few bits that could not be stored in a regular word. 
For sparse bitmaps, where most of the bits are 0, a WAH 
compressed bitmap would consist of pairs of a fill word and 
a literal word Q~3). 
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(b) Algorithm 2 



(d) Graph example 



Fig. 3. Three different ways of communication. 



Other lossless compression methods include run-length 
encoding, huffman coding, LZ77 lfl4l . or more dedicated 
bitmap compression method such as byte-aligned bitmap com- 
pression (BBC) lfT51 and position list word aligned hybrid 
(PLWAH) lfl6l . There is a space-time tradeoff among these 
compression schemes. Comparing to WAH, LZ77 is slower but 
has a better compression ratio. The benefit of compression will 
depend on many factors such as compression ratio, sparsity of 
the messages, compression speed and network bandwidth. The 
best compression scheme can not be determined beforehand, 
so we use experiment to analyse these tradeoffs. Details will 
be presented in Section IVIII 



V. BFS with Compression and Sieve 

The message size in the communication is reduced after 
compression. But there is still room for improvement. To 
achieve a better compression ratio, we can use a directory 
to sieve the bitmap, making it even sparser for compression. 

In this section we propose a distributed directory, cross 
directory, as a sieve to reduce the number of messages sent 
to each processor. We will first introduce the data structure 
of cross directory in subsection IV-AI then describe our BFS 
algorithm with compression and sieve in subsection IV-BI 

A. Cross Directory 

The problem of collective communication like Allgath- 
ERV is that it sends all frontier vertices to all the processors 
— just like snoopy cache coherence algorithms, all updates 
are visible to all processors — regardless whether a vertex 
is meaningful to each processor. Take a look at Figure f3] (a), 



v.. 1 • • • 1 v. 



A, p 



Processor i 

Fig. 4. The cross directory data structure for processor i. 

after the ALLGATHER communication, each processor actually 
get all the frontier vectors. In fact, each processor needs 
only a small fraction of the frontier, and this fraction can be 
determined before communication. For example, in Figure f3] 
(d), Pi only needs i>3 from P 2 . This means P2 does not need 
to send the information of v 2 to P4, because v 2 does not has 
a direct edge connecting to the vertices of P4. 

To explain this in algebra, we first partition the matrix A 
into p block-rows. Then partition each block A4 into p sub- 
blocks. 

MA 

A 2 



A®f = 



//A 

h 



A t = (Ai A 
To calculate / 4 = J2t=i A 

A 4 ,2 ® h 



i,i <S> fi, 

01 
00 



A; 



(1) 



(2) 



(3) 



because arj.o and 0.1,0 of A4 2 are always zero (denote A{j = 
[ct-i,j]mxn), 2/1 will always be zero. So P2 does not need to send 
x\ to P4. We define a data structure to record this information 
and use it to sieve communication messages. 

We formally define directory vector as follows: for each 
item Vk in vector Vi.j, Vk is set to one if column k in A^j 
contains at least one non-zero. 



Vij = (vx,v 2 , ■■■ ,v n ) 

1, 3a it k = l,i e [l,m],fc G [l,n] 
0, otherwise 



(4) 



where vj~ 



For the above example, V4.2 = (0, 1) is sent to P2 from P4 
during initialization. When traversing begins, f 2 is sieved into 
h,A = h © V4.2 = (1, 1) T (0, 1) T = (0, 1) T , so we only 
send one vertex (in compressed bitmap format) back instead of 
two. This "sieve effect" is where communication is reduced. 
And the cross directory of processor Pi is defined as: 



Q = {V x ,i or Vi tX I 



1,2,. 



,P} 



(5) 



Besides a row of directory vectors Vi = {Vi. y \ y 
1, 2, • • ■ ,p}, Pi own a copy of the directory vectors {V x 




Processor 1 



Processor 2 



Fig. 5. Communications in Directory-based BFS algorithm. Example of 
multiply A2,i with /i in five steps. 



Algorithm 3: Distributed BFS with sieving and compres- 

sion. 

Data: ft = !/,';./,'., • • • , /^.Jisend buffer; 

9i = {9iv9'i,2, ■ ■ ■ :9l„-i} :receive buffer; 
Qxross directory for Pi. 

1 /(«) «- s; 

2 initialize Q; 

3 foreach processor Pi in parallel do 

4 

5 
6 
7 



9 
10 

11 

12 
13 



while / ^ do 

ni <— ^ + ij; /i t*; 
foreach j e [0,n) /« parallel do 

fi,j = fi®Vj,i\ /* sieving */; 
//j «- Compress(fij); 

g[ <- Alltoallv(/ 1 ',P 4 ); 
foreach j e [0,n) in parallel do 

_ ^ Uncompress{g' itj ); 



x = 1,2, • ■ • ,p} in column i. The directory in the column di- 
rection is established during initialization and used to provide 
a local lookup for sieving (See Figure 

Figure [5] illustrates an example of communication with cross 
directory. The matrix is row-block partitioned among four 
processors. A2.1 <8> /1 is done in five steps, A2.1 need to get 
fi (step 1), P2 then send a request message to Pi (step 2), Pi 
check its local copy of V24 (step 3) and sieve /1 with the non- 
zero positions (step 4), then Pi send back a sieved /{ (step 
5). The sieved vector is very sparse and can be represented as 
sparse vector, reducing the communication cost. 

B. Sieve with Cross Directory 

Algorithm [3] is our directory-based algorithm with com- 
pression and sieve: based on Algorithm Q] Algorithm [3] first 
sieves the frontier bitmap with the cross directory (line 9), 
making it sparser; then it compresses this sieved bitmap (line 
10) and send it with ALLTOALLV (line 11); after received the 
compressed bitmap, the original vector could be restored with 
uncompression (line 13). 



This cross directory is inspired by Pinar and Hendrickson's 
distributed directory ifTTl and Baker et al.'s assumed partition 
algorithm lfT8l . In their work, the communication pattern 
is dynamically determined and more general, while in our 
case, the communication parties are static. So we store the 
directory on both side of the communication, and update 
them synchronously on each side instead of send the updated 
directory over the network each time. Another difference lies 
in the collective communication. In Baker et al.'s assumed 
partition algorithm, point-to-point rendezvous communication 
is used, we find that could be replaced with a more efficient 
ALLTOALLV. More generally, the cross directory is applicable 
to matrix-vector multiplication when following premises are 
true: 1) the partition of the matrix is static so that commu- 
nication parties are static; 2) the matrix remains unchanged 
and multiplication takes many times so that cross directory 
could be reused and its initialization cost could be omitted. 
For example, sum of the multiplication of the same matrix 
with different vectors Q^ILi Axi). 

C. Proof of Correctness 

In this subsection we prove the correctness of Algorithm [3] 
by proving its equivalence to Algorithm Q] 
Lemma 5.1: Aij (g) fj = Aij ® fj ®Vj t i. 



Proof: Let X = (xi,X2, 



1 <^n) 



A 



fj, Y = (2/1,3/2, • • ' ,Vn) = Aij 
(xivi,x 2 v 2 r ■ ■ ,x„v n ) T ,Vj ti = («i,U2, •■ 
(z 1 ,z 2 ,--- ,Zn) T - Denote A itj = [a !;j ] mx „, 



fj © V jti 



i 1 , and fj = 
then Xk = 



Xw=i a k,izi. According to the definition of directory vector, 

if v k = => Va fc , t = 0,i e [l,n] => x k = Ya=i a kd z i = 0, 
so y k = x k v k = = x k ; if v k = 1 =>• x k = x k v k = y k . Thus 

x = y. m 

Lemma 5.2: U = Y^j=i A-ij ® fi,j in Algorithm |3] (line 5) 
is equivalent to ti = Ai® f in Algorithm Q] (line 4). 

Proof: In Algorithm [3] for Vj G [l,n],/jj = fi V hi , 
according to Lemma |5J] ^"=i A i.j ® fi,j = S"=i A i 
fj © Vj,i = £?=i Ma U = M 



f j= Ai®f = U 
VI. Algorithm Analysis 



In this subsection we'd like to analyse the communication 
and space cost of the three algorithms in this paper. 

A. Communication Cost 

We study the parallel BFS problem in the message passing 
model of distributed computing: every processor has its own 
local memory, and data exchange between processors are done 
by message passing. The time taken to send a message be- 
tween any two processors can be modeled as T(n) = a + n/3, 
where a is the latency (or startup time) per message, indepen- 
dent of message size, f3 is the transfer time per byte (inverse of 
bandwidth), and n is the number of bytes transfered 1191 . This 
time cost model is generally used to model data movement 
either between levels of a memory hierarchy or over a network 
connecting processors. In this paper, we focus on the latter 
case. To simplify the analysis, we assume bandwidth cost is 
much bigger than latency cost (n/3 3> a), — as the dataset of 



distributed BFS is big, — therefore T{n) will be dominated by 
the bandwidth cost nf3. For a given network, (3 is constant, so 
the communication cost is in direct proportion to the message 
size n. Let communication volume of a processor V; be the 
size of all messages communicated on processor Pi in an 
algorithm. The communication volume of an algorithm is 
defined as V = max{Vi \ i £ 

The communication volume of MPI collective communi- 
cation is derived from |20l . I12T1 : For p processors, when 
each processor needs to broadcast n/p size of message to 
others, the communication volume of both allgather and all- 
toall are 0(n). There are many algorithms for allgather, for 
example, ring and recursive doubling l20l . The time taken 
for these two algorithm is T r i„ g = (p — l)a + ^y-nj3 and 
T re c_dbi = logpa + ^-n/3, respectively. No matter what 
algorithm is used, the bandwidth cost is the same =-n/3. 
In data-intensive applications like BFS, we assume bandwidth 
cost is much bigger than latency cost, so its communication 
volume is bound to 0(n). The communication volume of 
alltoall can be done in the same manner lETl . 

For graph G(V,E), let m = \E\, n = \V\, let d be the 
diameter of the graph. At each level of BFS, the communi- 
cation volume of allgather (Algorithm Q] line 7) is 0(n); the 
algorithm will finish at level d. So the communication volume 
of Algorithm [T] is d x 0(n). 

For Algorithm H] let Ci(Ci > 1) be the compression ratio 
of the Compression function of Algorithm |2] (line 7) at level i, 
let C = ^ Si=i Trip < 1) b e tne compression ratio factor. 
The communication volume of Algorithm [2] is Cd x 0(n). 

For Algorithm [3] let p be the number of the processors, 
e = m/n be the average degree of a vertex, and C be the 
compression ratio factor of Algorithm [3] The communication 
volume of Algorithm [3] is C'd x 0(n). After sieve, a vertex is 
sent to at most min(e,p) processors in Algorithm [3] instead 
of p in Algorithm [T] and 12 Thus Algorithm's messages will 
contain less nonzeros than Algorithm 0s, which leads to a 
higher compression ratio and a smaller C'(C < C). 

B. Memory Consumption 

For Algorithm Q] the memory consumption of / is 0(n); 
ti and 7Tj are 0(n/p). So the memory consumption of each 
processor of Algorithm Q] is 0{n). 

Compared to Algorithm Q] Algorithm [2] replace / with 
/', the memory consumption of which is at most as that of 
/, 0(n). So the memory consumption of each processor of 
Algorithm |2] is also 0(n). 

Compared to Algorithm [2] Algorithm [3] added Vi, which 
costs 0(n) memory. So the memory consumption of Algo- 
rithm [3] is also bound to 0(n). 

VII. Experimental Results 

This section presents experimental results for the distributed 
BFS. 



TABLE IV 
Experiment Platform 



System 


SMP Cluster 


Number of Nodes 


512 


Number of CPUs / node 


2 


Processor 


Intel X5650 


Number of cores 


6 


Number of threads 


12 


Core frequency 


2.66 GHz 


LI cache size 


384 KB 


L2 cache size 


1536 KB 


L3 cache size 


12 MB 


Memory type 


DDR3-1333 


QPI Speed 


6.4 GT/s 


Interconnect 


Infiniband 


Rate 


40 Gb/sec (4X QDR) 



A. Experiment Setup 

Our performance results is collected on a 512-node multi- 
core cluster system, connected by Infiniband of 40 Gb/s. Each 
node has an SMP architecture with two Xeon X5650 CPUs 
(Westmere), which are connected through Intel QuickPath 
Interconnect (QPI) of 6.4 GT/s. The Xeon X5650 has six 
cores, each supports simultaneous multithreading (SMT) up 
to two threads. Each node has 24GB DDR3-1333 RAM. In 
our experiments we used up to 512 node, or 6,144 cores, to 
run the experiment. We use gcc 4.3.4 and MPICH2 1.4.1 to 
compile our algorithms. The GNU OpenMP library is used for 
intra-node threading. See Table [TV] 

Our algorithms are based on Graph 500 benchmark. Input 
datasets are generated use synthetic kronecker graphs ll22l 
which follow power law distributions: heavy tails for the 
degree distribution; small diameters; and densification and 
shrinking diameters over time. That means most of vertices has 
a small number of neighboring vertices and the graph is sparse. 
The graph size is determined by two parameters: "Scale" and 
"Edge factor", where the total number of vertices N equals 
2 Scale , and the number of edges, M — edge factor * N. The 
default edgefactor is set 16. In order to save space, an adjacent 
array (or list) representing sparse graph is transformed into 
compressed sparse row (CSR) or column (CSC). We focus on 
the CSR-based BFS implementation in Graph 500. In order 
to compare the performance of Graph 500 implementations 
across a variety of architectures, a new performance metric 
is adopted in Graph 500. Let time be the measured execu- 
tion time for running BFS. Let m be the number of input 
edge tuples within the component traversed by the search, 
counting any multiple edges and self-loops. The normalized 
performance rate traversed edges per second (TEPS) is defined 
as: TEPS = m/time. 

Table [V] lists different BFS algorithms tested in our exper- 
iment. 



TABLE V 
Different BFS algorithms tested 




Fig. 6. Weak scaling performance of different BFS algorithms. The 
experiment use fixed problem size per node (each node has about 16M 
vertices). 

B. Experiment Results 

Figure [6] shows the weak scaling performance of our BFS 
algorithms. We run this experiment on our 512-node SMP 
cluster, with one process per SMP node. For intra-node 
threading, we use the GNU OpenMP library. Algorithm [3] 
(DIR-WAH) outperforms all other algorithms and have the 
best scalability. DIR-WAH achieves 1.21E+10 TEPS at scale 
33 with 512 nodes, 1.33x than Algorithm |2] (WAH), and 
2.24x faster than Algorithm Q](B/T). We can see the benefits 
of compression and sieve here: with compression, WAH is 
1.69x faster than BIT; with sieve, DIR-WAH is another 1.33x 
than WAH. The performance gap between DIR-WAH and BIT 
becomes wider as the number of nodes increases. This is 
because the larger the number of nodes is, the more distributed 
BFS algorithm will depend communication, and the more 
benefits compression and sieve will bring. We will see the 
time breakdown in the next figure. 

Figure [7] is the time breakdown of the algorithms in Fig- 
ure [6] "traversing" time is the time spent on local computing; 
"reducing" time is the time spent on a MPI reduction operation 
to get the total vertex count of the frontier; "communicatoin" 
time is the time spent on communication; "compression & 
sieve" time is the time spent on compression and sieve. 
For all three algorithms, as the number of nodes increases, 
"communication" times increase exponentially. For BIT, it 
accounts for as much as 73.2% of the total time for 512 node. 
The "reducing" times also increases because the imbalance of 
a graph become more severe as the graph becomes larger; 



Label Description 

traversing Local sparse matrix vector multiplication 

reducing MPI reduction to get vertex sum of current frontier 

communication Time spent on communication 

compression & sieve Time spent on compression and sieve 



the local "traversing" times remain more or less the same 
because the problem size per node is fixed. At 512 node, WAH 
reduces the "communication" time by 52.4% compared to 
BIT; DIR-WAH reduces the "communication" time by another 
55.9% compared to WAH, achieving a total 79.0% reduction 
compared to BIT, from 18.6 seconds to 3.9 seconds. On 
one hand, the "compression & sieve" time of WAH (only 
compression time is counted for WAH) at 512 nodes is less 
than 0.1% of the total run time and not shown in the figure. 
This means the benefit of compression is at very little cost. 
On the other hand, the time of "compression & sieve" in DIR- 
WAH, — the computing time traded for bandwidth — accounts 
for 11.1% of the total. This is because Algorithm [3] (line 9) 
needs to copy the frontier for each process before sieve. This 
copying time is expensive because it is in direct proportion 
to the number of processes. Overall, comparing DIR-WAH to 
WAH (512 nodes), sieve costs about 1.3 seconds but saves 5.0 
seconds in communication — the saving is worth the cost. 

Figure [8] plots the performance of different BFS algorithms 
at different scales. The experiment runs on 512 nodes. We can 
learn from this plot that the compression and sieve method 
favours larger messages. The size of messages will affect the 
results: at scale 26, DIR-WAH, WAH and BIT need to exchange 
8MB bitmap globally using MPI collective communications; 
at scale 33, 1GB. DIR-WAH is the slowest when the scale 
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Fig. 8. Performance of different BFS algorithms at different scales. The 
experiment runs on 512 nodes. 
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Fig. 9. Weak scaling performance result of BFS of different compression 
methods. The experiment use fixed problem size per node (each node has 
about 16M vertices). 



is small, but it gradually catches up and surpasses all other 
algorithms when scale gets bigger. 

As mentioned in section [IVj different methods could be 
used for compression. We did not implement all of them but 
choose two, Zlib library l23l and WAH, based on following 
reasons: Zlib library is famous for good compression on 
a wide variety of data and provides different compression 
levels; WAH is dedicated to bitmap compression, simpler than 
PLWAH and faster than BBC. We use Zlib 1.2.6, and three 
different compression levels: best compression (ZLB-BC), best 
speed (ZLB-BS) and default (ZLB-DF). The results are plotted 
in Figure [9] and Figure [TOj 

Figure [9] shows the weak scaling performance of BFS 
algorithms with different compression and sieve methods. BFS 
with Zlib best compression ZLB-BC is the slowest. With 512 
nodes, DIR-WAH provides the best performance, followed by 
ZLB-BS (69.9% of DIR-WAH), DIR-ZLB-BS (66.7%), ZLB-BC 
(53.5%), and DIR-ZLB-DF (39.7%) respectivelly. 
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Fig. 10. Time profiling of different compression implementations. 

Figure [10] shows the time breakdown of these algorithms. 
At scale 33 with 512 nodes, DIR-ZLB-DF's "communication" 
time is the smallest, 0.82 x of DIR-WAH, followed by DIR- 
ZLB-BS (0.83x), DIR-ZLB-BC (1.23x), ZLB-BS (1.57x) and 
ZLB-BC (1.61 x). Although DIR-ZLB-DF and DIR-ZLB-BS^ 
communication times are less than DIR-WAH, their "compres- 
sion and sieve" times are 14.25x and 5.44x of DIR-WAH. So 
the overall performance of DIR-ZLB-DF and DIR-ZLB-BS are 
worse than DIR-WAH. For all three compression levels in Zlib 
we tested, default method, not the best compression method, 
provides the best compression ratio. In fact, the Zlib best 
compression method is not suited for bitmap compression: it is 
not only the slowest, but also provides the worst compression 
ratio. 

VIII. Related Works 

Several different approaches are proposed to reduce the 
communication in distributed BFS. Yoo et al. J4) run dis- 
tributed BFS on IBM BlueGene/L with 32,768 nodes. Its 
high scalability is achieved through a set of memory and 
communication optimizations, including a two-dimensional 
partitioning of the graph to reduce communication overhead. 
Bulug and Madduri J5J improved Yoo et al.'s work by adding 
hybrid MPI/OpenMP programming to optimize computation 
on state-of-the-art multicore processors, and managed to run 
distributed BFS on a 40,000-core machine. The method of 
two-dimensional partitioning reduces the number of processes 
involved in collective communications. Our algorithm reduces 
the communication overhead in a different way: minimizing 
the size of messages with compression and sieve. Moreover, 
these two optimizations could be combined together to fur- 
ther reduce the communication cost in distributed BFS. A 
preliminary result is presented in Section [IX] to demonstrate 
its potential. Beamer et al. 1241 use a hybrid top-down and 
bottom-up approach that dramatically reduces the number of 
edges examined. The sample code in Graph 500 use bitmap 



(bitset array) in communication, reducing its message size. 
Cong et al. Ifl2l applying communication coalescing in PGAS 
implementation to minimize message overhead. 

Benchmarks, algorithms and runtime systems for graph 
algorithms have gained much popularity in both academia 
and industry. Earlier works on Cray XMT/MTA J25l, J26) 
and IBM Cyclops-64 (8) prove that both massive threads and 
fine-grained data synchronization improve BFS performance. 
Bader and Madduri [25 1 designed a fine-grained parallel BFS 
which utilizes the support for hardware threading and syn- 
chronization provided by MTA-2, and ensures that the graph 
traversal is load-balanced to run on thousands of hardware 
threads. Mizell and Maschhoff ||26ll discussed an improvement 
on Cray XMT. Using massive number of threads to hide la- 
tency has long be employed in these specialized multi-threaded 
machines. With the recent progress of multi-core and SMT, 
this technique can be popularized to more commodity users. 
Both core-level parallelism and memory-level parallelism are 
exploited by Agarwal et al. l27l for optimized parallel BFS 
on Intel Nehalem EP and EX processors. They achieved 
performances comparable to special purpose hardwares like 
Cray XMT and Cray MTA-2 and first identified the capability 
of commodity multi-core systems for parallel BFS algorithms. 
Scarpazza et al. l28l use an asynchronous algorithm to opti- 
mize communication between SPE and SPU for running BFS 
on STI CELL processors. Leiserson and Schardl fl29l use 
Cilk++ runtime model to implement parallel BFS. Cong et 
al. lfL2l present a fast PGAS implementation of distributed 
graph algorithms. Another trend is to use GPU for parallel 
BFS, for they provide massively parallel hardware threads, 
and are more cost-effective than the specialized hardwares. 
Generally, GPUs are good at regular problems with contiguous 
memory accesses. The challenge of designing an effective 
BFS algorithm on GPU is to solve the imbalance between 
threads and to hide the cost of data transfer between CPU and 
GPU. There are several works l30l . 1311 . ll32l working on this 
direction. 

IX. Conclusion 

The main purpose of this paper is to reduce the commu- 
nication cost in distributed breadth-first search (BFS), which 
is the bottleneck of the algorithm. We found two problems 
in previous distributed BFS algorithms: first, their message 
formats are not condensed enough; second, broadcasting mes- 
sages causes waste. We propose to reduce the message size 
by compressing and sieving. By compressing the messages, 
we reduce the communication time by 52.4%. By sieving the 
messages with a distributed directory before compression, we 
reduce the communication time by another 55.9%, achieving 
a total 79.0% reduction in communication time and 2.2 x 
performance improvement over the baseline implementation. 

For future works, we would like to combine our opti- 
mization of message size with other methods such as two- 
dimensional partitioning [ 5 ] and hybrid top-down and bottom- 
up algorithm (24]. The potential is clear. A preliminary op- 
timization of the distributed BFS algorithm in combinational 



BLAS library l33l . compressing the sparse vector using Zlib 
library, reduces the communication time by 41.9% and in- 
creases overall performance by 1.11 x. By using compressed 
bitmap and adding sieve, we expect to further improve its 
performance. 
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